Assignment 7 - Dataset Analysis Preliminary Work

Introduction

In this report; first, we will work with different methods used to express datasets that we express with multidimensional feature vectors as two-dimensional vectors without any loss of information. Refactoring the dataset to two dimensions provides us with great convenience, especially in visualization, since we can observe the cluster structures at a very high level and have information about the data.

Next, we will apply different clustering methods with different hyperparameters such as number of cluster, linkages, etc. to our dataset. We will compare the consistency of our interpretations of the clusters when we visualize them formed as a result of different projection methods with the actual clusters formed by clustering methods.

And finally, by using some clustering evaluation and validation methods,we will evaluate how the cluster structures that we created as a result of the projection and clustering sections are the right choices for clustering our dataset, and we will present you the final version of the cluster structures in our dataset, with a better clustering algorithm and hyperparameters, if any. We will use external, internal, and relative criteria when making this validity assessment.

Data

For the final report, we chose the dataset called Pokemons with stats from the Kaggle (https://www.kaggle.com/abcsds/pokemon).

The dataset stores 800 Pokemons, including their name, first and second type, and basic stats: HP, Attack, Defense, Special Attack, Special Defense, and Speed. These statistics are derived from values used in Pokemon games. This dataset consists of 13 columns, 9 of which are numerical (integer), 3 are string and 1 is boolean variable.

setwd("C:/Users/USER/Desktop/R Tutorials/Dataset Selection")
df = read.csv("Pokemon.csv")
head(df)
##   X.                  Name Type.1 Type.2 Total HP Attack Defense Sp..Atk
## 1  1             Bulbasaur  Grass Poison   318 45     49      49      65
## 2  2               Ivysaur  Grass Poison   405 60     62      63      80
## 3  3              Venusaur  Grass Poison   525 80     82      83     100
## 4  3 VenusaurMega Venusaur  Grass Poison   625 80    100     123     122
## 5  4            Charmander   Fire          309 39     52      43      60
## 6  5            Charmeleon   Fire          405 58     64      58      80
##   Sp..Def Speed Generation Legendary
## 1      65    45          1     False
## 2      80    60          1     False
## 3     100    80          1     False
## 4     120    80          1     False
## 5      50    65          1     False
## 6      65    80          1     False

Let’s check for missing values.

sum(is.na(df))
## [1] 0

As you can see, there is no missing values.
Now, let’s look at the summary of the data.

summary(df)
##        X.            Name              Type.1             Type.2         
##  Min.   :  1.0   Length:800         Length:800         Length:800        
##  1st Qu.:184.8   Class :character   Class :character   Class :character  
##  Median :364.5   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :362.8                                                           
##  3rd Qu.:539.2                                                           
##  Max.   :721.0                                                           
##      Total             HP             Attack       Defense      
##  Min.   :180.0   Min.   :  1.00   Min.   :  5   Min.   :  5.00  
##  1st Qu.:330.0   1st Qu.: 50.00   1st Qu.: 55   1st Qu.: 50.00  
##  Median :450.0   Median : 65.00   Median : 75   Median : 70.00  
##  Mean   :435.1   Mean   : 69.26   Mean   : 79   Mean   : 73.84  
##  3rd Qu.:515.0   3rd Qu.: 80.00   3rd Qu.:100   3rd Qu.: 90.00  
##  Max.   :780.0   Max.   :255.00   Max.   :190   Max.   :230.00  
##     Sp..Atk          Sp..Def          Speed          Generation   
##  Min.   : 10.00   Min.   : 20.0   Min.   :  5.00   Min.   :1.000  
##  1st Qu.: 49.75   1st Qu.: 50.0   1st Qu.: 45.00   1st Qu.:2.000  
##  Median : 65.00   Median : 70.0   Median : 65.00   Median :3.000  
##  Mean   : 72.82   Mean   : 71.9   Mean   : 68.28   Mean   :3.324  
##  3rd Qu.: 95.00   3rd Qu.: 90.0   3rd Qu.: 90.00   3rd Qu.:5.000  
##  Max.   :194.00   Max.   :230.0   Max.   :180.00   Max.   :6.000  
##   Legendary        
##  Length:800        
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

Although there are 800 instances in the dataset, there are 721 different IDs because Pokemons, which are advanced versions of each other, are stored with the same ID.

Finally, let’s observe the correlation between the variables.

numeric_df = df[sapply(df, is.numeric)]
corr_matrix = cor(numeric_df)
corr_matrix
##                    X.      Total         HP     Attack    Defense    Sp..Atk
## X.         1.00000000 0.11981278 0.09761387 0.10229779 0.09478582 0.08875893
## Total      0.11981278 1.00000000 0.61874835 0.73621065 0.61278743 0.74724986
## HP         0.09761387 0.61874835 1.00000000 0.42238603 0.23962232 0.36237986
## Attack     0.10229779 0.73621065 0.42238603 1.00000000 0.43868706 0.39636176
## Defense    0.09478582 0.61278743 0.23962232 0.43868706 1.00000000 0.22354861
## Sp..Atk    0.08875893 0.74724986 0.36237986 0.39636176 0.22354861 1.00000000
## Sp..Def    0.08581650 0.71760947 0.37871807 0.26398955 0.51074659 0.50612142
## Speed      0.01073349 0.57594266 0.17595206 0.38123974 0.01522660 0.47301788
## Generation 0.98251567 0.04838402 0.05868251 0.05145134 0.04241857 0.03643683
##               Sp..Def       Speed  Generation
## X.         0.08581650  0.01073349  0.98251567
## Total      0.71760947  0.57594266  0.04838402
## HP         0.37871807  0.17595206  0.05868251
## Attack     0.26398955  0.38123974  0.05145134
## Defense    0.51074659  0.01522660  0.04241857
## Sp..Atk    0.50612142  0.47301788  0.03643683
## Sp..Def    1.00000000  0.25913311  0.02848599
## Speed      0.25913311  1.00000000 -0.02312106
## Generation 0.02848599 -0.02312106  1.00000000
library(corrplot)
## corrplot 0.90 loaded
corrplot(corr_matrix)

Let’s just consider the numerical features excluding ID feature to make our data set suitable for PCA, MDS methods and clustering.

numDF = df[c(5,6,7,8,9,10,11,12)]
head(numDF,5)
##   Total HP Attack Defense Sp..Atk Sp..Def Speed Generation
## 1   318 45     49      49      65      65    45          1
## 2   405 60     62      63      80      80    60          1
## 3   525 80     82      83     100     100    80          1
## 4   625 80    100     123     122     120    80          1
## 5   309 39     52      43      60      50    65          1

Methods and Code

PCA - Principal Component Analysis

In this part, we used the video: https://www.youtube.com/watch?v=0Jp4gsfOLMs. [1]

We first apply PCA to the dataset with the prcomp() function. Then, we visualize it according to the two principal components with the highest variation.

pca = prcomp(numDF, scale=TRUE)
plot(pca$x[,1], pca$x[,2], xlab="PC1", ylab="PC2", main="PCA")

Let’s draw a Scree Plot to see how much variation each principal component contains.

pca.var = pca$sdev^2
pca.var.per = round(pca.var/sum(pca.var)*100,1)
barplot(pca.var.per, main="Scree Plot", xlab="PC", ylab="Percent Variation")

It’s important to use the summary() function to observe the Cumulative Proportion of Variance of principal components.

summary(pca)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.9269 1.0536 0.9934 0.88030 0.84869 0.65380 0.51715
## Proportion of Variance 0.4641 0.1388 0.1234 0.09686 0.09003 0.05343 0.03343
## Cumulative Proportion  0.4641 0.6029 0.7262 0.82310 0.91314 0.96657 1.00000
##                              PC8
## Standard deviation     1.726e-15
## Proportion of Variance 0.000e+00
## Cumulative Proportion  1.000e+00

In order for the accumulated proportion to exceed 0.9, we must use 5 components.

Let’s create a more visually satisfying graph using the ggplot library.

library(ggplot2)
pca.data = data.frame(X=pca$x[,1], Y=pca$x[,2])
ggplot(data=pca.data, aes(x=X, y=Y, label="+")) + geom_text() + xlab(paste("PCA1 - ", pca.var.per[1], "%", sep="")) + ylab(paste("PC2 - ", pca.var.per[2], "%", sep="")) + theme_bw() + ggtitle("PCA Graph")

Finally, let’s examine which feature is used with which weight in the calculation of the two principal components we use in the visualization since the principal components are actually linear combinations of features.

loading_scores = pca$rotation[,1]
loading_scores = abs(loading_scores)
ld_ranked = sort(loading_scores, decreasing = TRUE)
pca$rotation[names(ld_ranked),1]
##      Total    Sp..Atk    Sp..Def     Attack         HP    Defense      Speed 
## 0.51853234 0.38980094 0.37971841 0.37801677 0.32945198 0.31314747 0.29027307 
## Generation 
## 0.03518934
loading_scores2 = pca$rotation[,2]
loading_scores2 = abs(loading_scores2)
ld_ranked2 = sort(loading_scores2, decreasing = TRUE)
pca$rotation[names(ld_ranked2),2]
##        Speed      Defense   Generation      Sp..Atk      Sp..Def           HP 
##  0.639859006 -0.571057474 -0.362921024  0.283525992 -0.203158632 -0.104395366 
##        Total       Attack 
##  0.014201859 -0.001280817

MDS - Multidimensional Scaling

PCoA with Manhattan Distance

In the following two parts, we used the video: https://www.youtube.com/watch?v=pGAUHhLYp5Q. [2]

First, we will apply Classical MDS, also known as Principal Coordinate Analysis (PCoA or PCO). We avoid Euclidian distance as distance metric because we know that it would give the same result as PCA if we used it. Therefore we prefer Manhattan distance. We used the cmdscale() function for this. Then, we examined how much variation the first 5 elements had.

dist.matrix = dist(scale(numDF, center=TRUE, scale=TRUE), method="manhattan")
mds = cmdscale(dist.matrix, eig=TRUE, x.ret=TRUE)
mds.var.per = round(mds$eig/sum(mds$eig)*100,1)
mds.var.per[1:5]
## [1] 54.9 14.6 12.9  9.1  7.5

In order for the accumulated proportion to exceed 0.9, we must use 4 components.

Here is the distribution of the data according to the MDS1 and MDS2 after the Principal Coordinate Analysis with Manhattan distance.

mds.values = mds$points
mds.data = data.frame(X=mds.values[,1], Y=mds.values[,2])
ggplot(data=mds.data, aes(x=X,y=Y, label="+")) + geom_text() + theme_bw() + xlab(paste("MDS1 - ", mds.var.per[1], "%", sep="")) + ylab(paste("MDS2 - ", mds.var.per[2], "%", sep="")) + ggtitle("MDS Plot Using Manhattan Distance")

PCoA with avg(logFC) Distance

We saw this method in the video of the Youtube channel named StatQuest with Josh Starmer (https://www.youtube.com/watch?v=pGAUHhLYp5Q). Thanks to this method, there was a large increase in the variation of the first two components. Since we needed this increase, we thought it appropriate to try this method. But since there is no such distance method in the dist() function, we had to create these values manually. For this, we created an matrix filled with 0 and then filled its lower triangle.

log2.numDF = log2(numDF)
log2.dist = matrix(0, nrow=nrow(log2.numDF), ncol=nrow(log2.numDF))
for(i in 1:nrow(log2.dist)) {
  for(j in 1:i) {
    log2.dist[i,j] <- rowMeans(abs(log2.numDF[i,] - log2.numDF[j,]))
  }
}

We completed the analysis by applying the same procedures as the previous method.

pco = cmdscale(as.dist(log2.dist), eig=TRUE, x.ret=TRUE)
pco.var.per = round(pco$eig/sum(pco$eig)*100,1)
pco.var.per[1:10]
##  [1] 50.6 20.8 12.8  9.0  6.5  5.6  5.2  3.4  2.8  2.5

In order for the accumulated proportion to exceed 0.9, we must use 4 components.

pco.values = pco$points
pco.data = data.frame(X=pco.values[,1], Y=pco.values[,2])
ggplot(data=pco.data, aes(x=X,y=Y, label="+")) + geom_text() + theme_bw() + xlab(paste("MDS1 - ", pco.var.per[1], "%", sep="")) + ylab(paste("MDS2 - ", pco.var.per[2], "%", sep="")) + ggtitle("MDS Plot Using avg(logFC) as the Distance")

Kruskal’s Non-metric Multidimensional Scaling

We couldn’t find as many resources on implementing this method as others, so our analysis was a bit more limited. We used the R documentation while writing this code. [3]

However, we ran into a problem in implementing this function. Instances with a distance of zero and negative value were preventing the function from working by causing an error. Since we used Euclidean distance, it was impossible for the distance between two samples to be negative. Therefore, we suspected that these distances were zero. When we examined the samples that were said to have a zero distance between them in the original numDF data frame, we noticed that they were duplicates by deleting the non-numeric columns. That’s why we removed the duplicate values before applying this method.

library(MASS)
notDup = numDF[!duplicated(numDF),]
dist2 = dist(scale(notDup, center=TRUE, scale=TRUE), method="euclidian")
iso = isoMDS(dist2, y=cmdscale(dist2,2), k=2, maxit=20, p=2)
## initial  value 24.979582 
## iter   5 value 20.034201
## iter  10 value 18.062551
## final  value 17.807568 
## converged

Even though I set the maximum number of iterations to 20, isoMDS() converged to 17.81 between 10th and 15th iterations. This is quite a large value, much smaller would be better for the MDS of the dataset; however, in the above analyzes we have seen that the dataset does not perform very well in multidimensional scaling.

iso.values = iso$points
iso.data = data.frame(X=iso.values[,1], Y=iso.values[,2])
ggplot(data=iso.data, aes(x=X,y=Y, label="+")) + geom_text() + theme_bw() + xlab(paste("ISO1")) + ylab(paste("ISO2")) + ggtitle("Non-metric MDS")

Hierarchical Clustering

For clustering part, first, we will apply the hierarchical clustering to the dataset. While doing this, we will use different linkage methods. Let’s start with a single-link first.

For this stage, we first scaled our dataset. We created a distance matrix distances using Euclidean distance and we hierarchically clustered the dataset using this matrix and the hclust() function. You can see the dendogram formed as a result below.

scaledNumDF = scale(numDF)
distances = dist(scaledNumDF, method="euclidian")
hierClustMin = hclust(distances, method="single")
plot(hierClustMin, main="Hierarchical Clustering with Simple Link")

As stated in the course materials, single-link hierarchical clustering is very sensitive to outliers and noise. As can be seen from the graphs we drew in the PCA and MDS parts, although the samples in the dataset form a clustered structure, they also take up too much space outside of these clusters. This shows that our dataset actually includes outlier values. Since the dendogram is too cramped and complex to observe with the naked eye, we do not want to focus on this method any more and we move on to the next linkage method.

Next, we implemented hierarchical clustering with complete-linkage.

hierClustMax = hclust(distances, method="complete")
plot(hierClustMax, main="Hierarchical Clustering with Complete Link")

In the graphs we drew in the PCA and MDS sections, we observed that there were 3 to 5 clusters in the dataset. Looking at this dendogram, it seems appropriate to choose 5 as the number of clusters. One of these clusters will consist of the sample at index 231. In this case, we can say that this example is a stand-alone outlier cluster.

Let’s see how our dataset looks when we divide it into 5 clusters. We used the factoextra library for this.

library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
hierarchicalClusters = cutree(hierClustMax, k=5)
fviz_cluster(list(data=numDF, cluster=hierarchicalClusters))

In the course materials, it was said that there is a bias towards the globular cluster in full-link hierarchical clustering, and that is why large clusters are divided. It is possible to see this situation in the clustering in the figure. There is very high overlap between clusters. This may be related to the fact that a multidimensional feature vector loses information when reducing to two dimensions, but it is still unsatisfactory in terms of such a visual result.

Afterwards, we implemented hierarchical clustering with group average.

hierClustAvg = hclust(distances, method="average")
plot(hierClustAvg, main="Hierarchical Clustering with Group Average")

Just like in the simple-link, the dendogram of the group average method is not very suitable for choosing the number of clusters. Therefore, we skip this method as well.

Finally, we implemented hierarchical clustering with Ward’s method.

hierClustWard = hclust(distances, method="ward.D2")
plot(hierClustWard, main="Hierarchical Clustering with Ward's Method")

The dendogram of the group average gives us a very clear result compared to the previous 3 dendograms. Therefore, we would like to focus on this dendogram rather than others. As we mentioned before, we observed that it would be appropriate to choose a cluster number between 3 and 5 in our graphs in the PCA and MDS sections. Therefore, let’s observe these two values in this clustering method.

plot(hierClustWard, main="Hierarchical Clustering with Ward's Method")
rect.hclust(hierClustWard, k=3, border = 2:4)

hierarchicalClusters2 = cutree(hierClustWard, k=3)
tapply(numDF$Total, hierarchicalClusters2, mean)
##        1        2        3 
## 311.1800 484.1952 632.7882

Since the feature named Total is the sum of the numeric values in all the other features except the feature named Generation, I thought Total feature would give information about the examples in a very superficial way. Therefore, I looked at the mean of the Total feature in each cluster. In this way, I think that the samples in the dataset are divided into clusters of “weak”, “moderate” and “strong” Pokemons as the strengths in the game scores.

Let’s visualize these clusters.

fviz_cluster(list(data=numDF, cluster=hierarchicalClusters2))

There is still a large amount of overlap in the clusters, but it is a more acceptable result than the previous clusters.

Let’s repeat the same process with 5 clusters.

plot(hierClustWard, main="Hierarchical Clustering with Ward's Method")
rect.hclust(hierClustWard, k=5, border = 2:6)

hierarchicalClusters3 = cutree(hierClustWard, k=5)
fviz_cluster(list(data=numDF, cluster=hierarchicalClusters3))

Although it is a representation that shows only 60.3% of the variation of the dataset when reduced to two dimensions, such an overlapped clustering is unfortunately not useful for having an idea about the data.

K-means Clustering

Before moving on to the k-means clustering method, let’s first draw an elbow plot to choose an appropriate k value and see what the wss value is in each k number.

fviz_nbclust(scaledNumDF, kmeans, method="wss") + labs(subtitle = "Elbow Plot")

As can be seen from the plot, the break point in the elbow method occurs at the point k=2. For this reason, we will start from 2 to try the number of clusters and continue until 6.

set.seed(2021)
kmeansCluster2 = kmeans(numDF, 2, nstart = 5)
fviz_cluster(list(data=numDF, cluster=kmeansCluster2$cluster))

Although the clustering does a pretty good job, we think clustering the data in 2 clusters will create an insufficient number of classes for Pokemon segmentation.

Let’s change the initialization and see how it will affect clustering.

kmeansCluster2_2 = kmeans(numDF, 2, nstart = 25)
fviz_cluster(list(data=numDF, cluster=kmeansCluster2_2$cluster))

When we changed the initialization, we saw that there were very small changes at the surface where the clusters touched each other.

Let’s change the cluster number to 3 and see what was changed.

kmeansCluster3 = kmeans(numDF, 3, nstart = 5)
fviz_cluster(list(data=numDF, cluster=kmeansCluster3$cluster))

We can say that the results are much better understandable than hierarchical clusters.

kmeansCluster3_2 = kmeans(numDF, 3, nstart = 35)
fviz_cluster(list(data=numDF, cluster=kmeansCluster3_2$cluster))

We changed the initialization once again and we could see almost no difference. Therefore, we stop changing the initializations and only examine the effect of the number of clusters on clustering.

Let’s make cluster number 4.

kmeansCluster4 = kmeans(numDF, 4, nstart = 5)
fviz_cluster(list(data=numDF, cluster=kmeansCluster4$cluster))

Let’s try k=5.

kmeansCluster5 = kmeans(numDF, 5, nstart = 25)
fviz_cluster(list(data=numDF, cluster=kmeansCluster5$cluster))

Lastly, we want to try k=6, since we think that the rightmost cluster in the figure should also be divided into two clusters.

kmeansCluster6 = kmeans(numDF, 6, nstart = 25)
fviz_cluster(list(data=numDF, cluster=kmeansCluster6$cluster))

k means clustering did a pretty good job on visualization. Although the most appropriate k value is 2 according to the elbow method, we can work with more clusters.

Cluster Validation

In the previous part, we could say that the most appropriate choice was to apply k-means clustering and choose the k value of 6. We will now compare this decision with different clustering validation methods and decide on the cluster structure of our dataset with calculations that are more reliable than the human eye.

Although different indexes such as internal, external and relative are specified in this topic in the course materials, we could not find such a clear distinction in R functions and packages. That is why we thought it appropriate to apply some of these methods and metrics at the same time.

We used the clValid library to find the internal consistency values and stability measures. Since we use hierarchical and k-means clustering algorithms and we usually change the number of clusters between 3 and 6, I called the clValid() function with parameters according to these values.

library(clValid)
## Zorunlu paket yükleniyor: cluster
clusterMethods = c("hierarchical", "kmeans")
internal = clValid(numDF, nClust = 3:6, clMethods = clusterMethods, maxitems = 1000, validation = "internal")
## Warning in clValid(numDF, nClust = 3:6, clMethods = clusterMethods, maxitems =
## 1000, : rownames for data not specified, using 1:nrow(data)
summary(internal)
## 
## Clustering Methods:
##  hierarchical kmeans 
## 
## Cluster sizes:
##  3 4 5 6 
## 
## Validation Measures:
##                                   3        4        5        6
##                                                               
## hierarchical Connectivity    7.9155  11.8984  29.2881  49.8254
##              Dunn            0.1535   0.1535   0.0673   0.0903
##              Silhouette      0.3688   0.3047   0.4145   0.4097
## kmeans       Connectivity   41.3099  63.8333 207.1782 211.1107
##              Dunn            0.0792   0.0715   0.0478   0.0478
##              Silhouette      0.4334   0.4004   0.2850   0.2875
## 
## Optimal Scores:
## 
##              Score  Method       Clusters
## Connectivity 7.9155 hierarchical 3       
## Dunn         0.1535 hierarchical 3       
## Silhouette   0.4334 kmeans       3

Connectivity, Dunn and Silhouette values obtained according to the combinations of clustering methods with different cluster numbers are as above. According to the connectivity and dunn metrics, as you can see, the 3-cluster hierarchical clustering method gives the best result, while the most appropriate clustering algorithm according to the silhouette metric is k-means clustering with k=3.

plot(internal)

Now let’s examine the stability measures by changing the validation parameter of the clValid() function.

stab = clValid(numDF, nClust = 3:6, clMethods = clusterMethods, maxitems = 1000, validation = "stability")
## Warning in clValid(numDF, nClust = 3:6, clMethods = clusterMethods, maxitems =
## 1000, : rownames for data not specified, using 1:nrow(data)
summary(stab)
## 
## Clustering Methods:
##  hierarchical kmeans 
## 
## Cluster sizes:
##  3 4 5 6 
## 
## Validation Measures:
##                          3        4        5        6
##                                                      
## hierarchical APN    0.0580   0.2291   0.0517   0.0368
##              AD   169.3360 168.7530 119.2382 103.7467
##              ADM   13.2969  50.5055  20.7552  17.9379
##              FOM   36.7039  34.7147  33.3393  32.3068
## kmeans       APN    0.0544   0.2306   0.1353   0.1301
##              AD    97.9146  97.4735  86.6259  83.1601
##              ADM    8.6290  24.4035  15.5246  13.6436
##              FOM   25.9650  23.6649  23.4388  23.3290
## 
## Optimal Scores:
## 
##     Score   Method       Clusters
## APN  0.0368 hierarchical 6       
## AD  83.1601 kmeans       6       
## ADM  8.6290 kmeans       3       
## FOM 23.3290 kmeans       6

If we look at the optimal combinations according to the stability criteria, we see that the k-means cluster method consisting of 3 and 6 clusters is in the majority.

Results

We summarize the result of the calculations with different clustering algorithms and cluster numbers.

vals = c("internal", "stability")
totalSummary = clValid(numDF, nClust = 3:6, clMethods = clusterMethods, maxitems = 1000, validation = vals)
## Warning in clValid(numDF, nClust = 3:6, clMethods = clusterMethods, maxitems =
## 1000, : rownames for data not specified, using 1:nrow(data)
summary(totalSummary)
## 
## Clustering Methods:
##  hierarchical kmeans 
## 
## Cluster sizes:
##  3 4 5 6 
## 
## Validation Measures:
##                                   3        4        5        6
##                                                               
## hierarchical APN             0.0580   0.2291   0.0517   0.0368
##              AD            169.3360 168.7530 119.2382 103.7467
##              ADM            13.2969  50.5055  20.7552  17.9379
##              FOM            36.7039  34.7147  33.3393  32.3068
##              Connectivity    7.9155  11.8984  29.2881  49.8254
##              Dunn            0.1535   0.1535   0.0673   0.0903
##              Silhouette      0.3688   0.3047   0.4145   0.4097
## kmeans       APN             0.0544   0.2306   0.1353   0.1301
##              AD             97.9146  97.4735  86.6259  83.1601
##              ADM             8.6290  24.4035  15.5246  13.6436
##              FOM            25.9650  23.6649  23.4388  23.3290
##              Connectivity   41.3099  63.8333 207.1782 211.1107
##              Dunn            0.0792   0.0715   0.0478   0.0478
##              Silhouette      0.4334   0.4004   0.2850   0.2875
## 
## Optimal Scores:
## 
##              Score   Method       Clusters
## APN           0.0368 hierarchical 6       
## AD           83.1601 kmeans       6       
## ADM           8.6290 kmeans       3       
## FOM          23.3290 kmeans       6       
## Connectivity  7.9155 hierarchical 3       
## Dunn          0.1535 hierarchical 3       
## Silhouette    0.4334 kmeans       3

As you can see in the summary above, 3-cluster hierarchical clustering and 3 and 6-cluster k-means clustering come to the fore among the most optimal combinations. After making a comparison between the scores obtained in the metrics where these combinations are not the most optimal, we decided that the best choice for clustering my dataset is the k-means clustering method with k=3. Therefore, the most optimal clustering is as follows.

fviz_cluster(list(data=numDF, cluster=kmeansCluster3$cluster), main = "K-means Clustering when k = 3")

Analysis and Evaluation of Results

In the visualizations we made in the PCA and MDS sections, we obtained very close graphs especially in terms of clusters, except for one. In the MDS we performed with avg(logFC), the resulting graph was slightly different from the others. Since we use a two-dimensional plane in all these visualizations, we used the first two elements with the highest variation, but this is not the right approach if we leave the visualization part aside. As stated in the lecture slides, the number of these components (k) should be chosen according to the cumulative proportion of variation ratio exceeding 0.9. Therefore, although we choose the method that has the highest total variation in the first two elements as a result of this report and use its graph to get an idea about clustering, the actual distribution may differ from this. However, many sources say that even when the variation proportions are close to each other and the cumulative proportion of variance of the first two elements does not exceed 0.9, the visualization made by considering the first two elements gives a result close to the real clusters.

Although hierarchical clustering methods did not give good results, the clusters I created with k-means clustering were satisfactory. However, when we look at the x and y axes of the resulting graphs, the fact that the total variation represented in the graph is 60.3% prevents me from having definite ideas about the entire dataset.

pco.var.per[1] + pco.var.per[2]
## [1] 71.4

In the face of this situation, a solution came to our mind. In one of the previous assignments, Projection of Data, we learned how to reduce our multidimensional data to a smaller size. In fact, in the Principal Coordinate Analysis we applied with avg(logFC) distance metric, we reached a 71.4% variation in the first two dimensions. Therefore, we want to see how the k-means clustering method will perform clustering in my projected dataset, where we have reduced the number of dimensions.

kmeansPCO = kmeans(pco.data, 3, nstart=40)
fviz_cluster(list(data=pco.data, cluster=kmeansPCO$cluster), main = "K-means Clustering into 3 Clusters")

Further Methods

In the final paper, we will perform a non-linear projection with ISOMAP and density-based clustering with the DBSCAN algorithm for further analysis.

Discussion, Expectations, and Perspectives

As a result of the projection, clustering and validation methods we have used so far, we have seen that the most appropriate algorithm and cluster number combination to cluster our dataset is k-means clustering and 3 clusters. Although the new clustering algorithm we will use in the future may result in a different clustering structure, we think that the best number of clusters will still be 3 clusters, because while examining the validity results, we saw that using 3 clusters for both hierarchical and k-means clustering methods is the majority.

In addition, if the projection method we will use contains a variation of more than 71.4% in the first two dimensions, we plan to apply the most appropriate clustering method and number to the dataset with reduced dimensions thanks to this projection method.

To summarize, we think that after our final work, there will still be 3 clusters in the ‘Pokemon with stats’ dataset.

References

[1] StatQuest with Jost Starmer. (2017,11,7). StatQuest: PCA in R[Video].Youtube. URL https://www.youtube.com/watch?v=0Jp4gsfOLMs.

[2] StatQuest with Jost Starmer. (2017,12,19). StatQuest: MDS and PCoA in R[Video].Youtube. URL https://www.youtube.com/watch?v=pGAUHhLYp5Q.

[3] https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/isoMDS.html